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Abstract. To extract the relevant features in a given dataset is a dif¬ 
ficult task, recently resolved in the non-negative data case with the 
Non-negative Matrix factorization (NMF) method. The objective of this 
research work is to extend this method to the case of missing and/or cor¬ 
rupted data due to outliers. To do so, data are denoised, missing values 
are imputed, and outliers are detected while performing a low-rank non¬ 
negative matrix factorization of the recovered matrix. To achieve this 
goal, a mixture of Bregman proximal methods and of the Augmented 
Lagrangian scheme are used, in a similar way to the so-called Alternat¬ 
ing Direction of Multipliers method. An application to the analysis of 
gene expression data of patients with bladder cancer is finally proposed. 


1. Introduction 

Finding a relevant dictionary for extracting the relevant featnres in a 
dataset is a very important task for many applications. The Non-negative 
Matrix Factorization (NMF) is a recent and very efficient method for achiev¬ 
ing this goal in the case of non-negative data. Given a dataset consisting of 
n vectors xi,... ,Xn in W^, the NMF approach consists of bnilding a matrix 
M whose colnmns are xi,... ,Xn and then factorize this matrix as 

M = UV^ + E, 

where E is an error term, U and V are componentwise non-negative, and U 
has a small nnmber of colnmns. The colnmns of U represent the “featnres” 
present in the dataset and the interpretation of this decomposition is that 
each data consists of a mixtnre of the discovered featnres. 

Since its stndy by Lee and Senng |13| in the late 90’s the method, hrst 
explored in the chemometrics commnnity, enjoyed a signihcant gain of inter¬ 
est from many application helds and especially in machine learning. It has 
been snccessfnlly nsed for docnment clnstering [13, email snrveillance [I], 
hyperspectral image analysis CH, face recognition [lO] . blind sonrce separa¬ 
tion 13, etc. It has recently also been applied to microarray data analysis 
|12] and biomedicine M- 
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The NMF has been computed via various approaches. One of the main 
strategy is the alternating minimization scheme, which consists in succes¬ 
sively minimizing in U and then in V. Since minimization in U (resp. V) 
is a convex optimization problem, such a strategy naturally comes to mind. 
Furthermore, it has been proven very efficient in practice. However, no con¬ 
vergence result towards a global minimizer has been provided up to now. 
Moreover, handling the nonnegativity constraints appears to be delicate, 
since the convergence speed of the method depends on the way this constraint 
is incorporated into the iterations. A breakthrough research work concern¬ 
ing the possible convergence guarantees is [7], where it was first proven that 
the problem could be convexified under separability assumptions. Following 
shortly after, |2] proposed an efficient approach based on linear programming 
also relying on separability. Separability is a property that can be summa¬ 
rized by saying that the features are some data vectors already belonging to 
the sample. Recently, under similar assumptions, [8] proposed a very simple 
approach based on successive projections. These impressive results apply 
to a large set of problems where separability holds. However, separability 
does not hold in very important cases, and there is still a lot of work to 
do in order to explore the possible performance guarantees of algorithms 
for NMF in the future. Back to the not necessarily separable case, in 
and |15) . authors proposed Bregman divergence based iterative methods for 
NMF. Bregman-divergence based proximal approaches have been the subject 
of great interest recently due to good practical performances and connection 
with mirror descent type algorithms, whose theoretical complexity has been 
extensively studied in the computational optimization community (see for 
instance the survey 0) 

The approach proposed in the present article relies on Bregman-proximal 
iterations. Our goal is to extend the method to the case where data may 
be missing and/or corrupted by the occurrence of outliers. Our approach 
borrows ideas from robust PCA |3], where the matrix to approximate is 
decomposed into a low rank part and a sparse part: 

M = L + S. 

The low rank part L is intended to approximate the data set which is sup¬ 
posed to be of low rank, and the sparse part S represents the outliers. In 
the seminal article @], the noise is not taken into account. However, in 
datasets such as gene expression data, the noise may be very large and one 
has to search for a low rank solution that removes the noise at the same 
time. This is done in the code GoDec for instance, in the case of robust 
PCA, see [9] for further information on this method. In the present work, 
we propose an efficient method that denoises the data, guesses the missing 
values, and detects the outliers in the matrix M while performing a low-rank 
non-negative matrix factorization of the recovered matrix. For this purpose, 
we use a mixture of Bregman proximal methods and of the Augmented La- 
grangian scheme as it is used in the Alternating Direction of Multipliers 
Method (ADMM). This mixture is also justified by the very informative re¬ 
cent work |16) . which presents a clear interpretation of the ADMM in terms 
of proximal method-type iterations. 
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In the next section, the Bregman proximal scheme is presented and in 
Section a version taking into account potential outliers and/or missing 
data is described in full details. Then the choice of the relaxation parameters 
is discussed in Section An application to the analysis of gene expression 
data of patients with bladder cancer is proposed in Section 

2. A Bregman proximal scheme for Non-negative Matrix 

Factorization 

Let /i be a strictly convex real valued function. Assume that h is con¬ 
tinuously differentiable and dehned on a closed convex set C. Then, for all 
x,y €C, the Bregman divergence associated to h is given by 


( 2 . 1 ) 


Dh{y, x) = h{y) - h{x) - {Vh{x), {y - x)). 


2.1. The space alternating Bregman-proximal scheme. In this sec¬ 
tion, we will consider the following Bregman-proximal algorithm, which al¬ 
ternates minimization in the variable U and minimization in the variable 
F: 

(2.2) ^ argmin[;gRdxn M - + pDh{U,U^^^) 

F 

(2.3) ^ argminygK<ix» M - V/5T>/,(F, F^^)) 

F 

where is the Bregman’s divergence associated with h{x) = a:ln(x), 

so we obtain 


(2.4) 


Dhiy,x) = xln{ -] + y - X, 


and /? is a positive constant. Let us consider the problem 

4%a)inf,gR.x. 1\\m- + p In -{u- . 

The gradient of 4>{U) is given by 

V(/(t/) = -(M-[/F*)F. 

Let us now compute the gradient of (p{U) defined by (p{U) = Dh{U,U^^^). 
A straightforward computation gives 

"Fc = 


dU, 




Therefore, taking one step in our Bregman-penalized subpace method sums 
up to solving 

(M-17F*)F = plnf^V 


Since no explicit solution to this decoupled system of equations, we will use 
a hxed point approach defined as follows. 

(1) Take t/(^+bO) = 
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(2) define 


(2.6) = exp Q[(M - . 


(3) Stop when the difference between two successive iterates is suffi¬ 
ciently small, e.g., less that le-3. Denote by I* the iteration number 
when this occurs and output 

The iterate can be obtained from using the same approach. The 

corresponding optimization problem associated to step A: -|- 1 is 


argmiuygj^nxr 


1 

2 





2.2. A toy numerical experiment. We start with a simple random ex¬ 
ample programmed in Matlab. Let Uq be a random matrix in with 

i.i.d. components having the uniform distribution on [0,1]. Let Vq be a ran¬ 
dom matrix in with components having the same distribution. Take 

M = UqVq, p = 100, and random initial matrices. Figure shows that the 
method converges to M in the sense that it produces a sequence of matrices 
and whose product converges to M. 



Figure 1. Evolution of the error M — as k goes 

from 1 to 100 on a random example. 


3. The case of outliers and missing data 


Let D denotes the set of couples {i,j) for which an observation of Mij is 
available. The matrix factorization problem can be addressed by considering 
the following optimization problem 


(3.7) 


subject to 


min 

Y,s,uy^o 


2 E {y^,J-{uv\Jf+\\\s\\, 


(3.8) M = y + 5, 

with II= '^ij I'S'ijI, and A is a relaxation parameter whose value is dis¬ 
cussed in Section H] 
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3.1. The augmented Lagrange function. The Lagrange function L{Y, 5, U, V, A) 
for our problem is equal to: 

(3.9^ ^ {Yij-{UV%f + X\\S\\^+ ^ -Yij - S^j). 

{i,j)en (*J)6G 

In order to enforce the constraint Mij = Yij — Sij for all {i,j) £ we 
introduce the following augmented Lagrange function 

(3.mn(y,5,[/,y,A) = L{Y,S,U,V) + p^ iM^J-Y,,-S,,f. 

We now introduce an Alternating Direction Method of Multipliers. This 
method consists of solving iteratively in all the variables one after the other, 
and then updating the dual variable. For this purpose, we compute in the 
next subsections the optimum value for the problem of minimizing the aug¬ 
mented Lagrange function with respect to each variable. 

3.2. Individual minimization subproblems in Y, S, U, and V. 


3.2.1. Minimization with respect to Y. The problem reduces to minimizing 
the function of the variable Y given by 

2 ~ {UV )ij) + Aij{Mij — Yij — Sij) + p— (Aftj — Yij — Sij) 

i,j {i,j)en ii,j)en 


Let us denote by Y* a solution to this problem. We have to consider two 
cases separately: either (i,j) G D, or {i,j) ^ D. The case (i,j) ^ D is 
obvious, since it is straightforward to check that Y*j = (UV^)ij is a solution. 
Setting the partial derivative to zero gives the result of the case (i,j) G D. 
To summarize, we obtain 


(3.11) Y*^ 


+ A,, + p{M,, - 5.,)), if(z, j) G D, 
_ {UV^)ij otherwise. 


3.2.2. Minimization with respect to S. We have to minimize the function of 
S given by 

(3.1|)S||i+ ^ Aij{Mij-Yij-Sij) + p^ {M^j-Y,j-Sijf. 

This can be performed by optimizing each component of S independently of 
the other. As for the case of minimizing with respect to Y, we distinguish 
between two cases, while if {i,j) ^ D one easily checks that S*j = 0. If 
(i,j) G D, we will use the following result. 

Theorem 3.1. The solution to 

(3.13) min-(y — x)^-|-A |x| 

xSM 2 
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is given by 


(3.14) 


= 


y - A if y > X, 
y + X if y < X, 
0 otherwise. 


Based on this result, we easily obtain 

—Mij + Yij + Aij — ^,if — Mij + Yij + Aij > 


(3.15) S*j = 


—Mij + Yij + Aij + - ,if — Mij + Yij + Aij < 

0 otherwise. 


3.2.3. Minimization with respect to U and V. We just have to use the fixed 


point subroutine given by (2.6). 


3.3. Our Bregman proximal-type ADMM. We will choose the starting 
values as follows. Set = 0. Set 

yo ^ iMij,\/{i,j)en, 

I imputation by the mean over all other observed values row i,\/{i,j) ^ fl. 

The Bregman-Proximal point ADMM is then given by 

(1) Set 5 = U = A = A(^) and obtain Y^^i) = y* 

given by (3.11 1, 

(2) Set Y = Y^^\U = D = V^^fA = A(^) and obtain S^Mi) = 


S* given by (3.15), 

(3) Set Y = ^ ^ ^(fc) jj{k+i) 

using the fixed point method (2.6) 


(4) Set Y = = 5 (fc+i)^[/ = [/(fc+i), A = A^^'^ and obtain U^Mi) 


using the fixed point method (2.6). 
(5) Set A^MI) ^ ^{k) ^m-Y -S 


4. Choosing the A value 

The choice of the parameter A is crucial for the good performances of 
the proposed method. We performed a selection of A using the following 
approach. 

(1) Propose an a priori range of values for A such that its maximum 
values leads to S equals the null matrix at optimality. 

(2) For each value of A, select a set 5 of s entries chosen uniformly at 
random in M and consider them as missing data temporarily. 

(a) Find the solution of (3.7), where the missing data incorporate 


the set of entries which were artificially declared as missing in 
the previous step. 

(b) Compute the average squared error on the data artificially de¬ 
clared as missing. Denote this quantity by err a. 

(3) Select A in the prescribed range as the one which minimizes the 
average squared error err a. 
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5. Application to bladder cancer expression data 

5.1. Description of the data. The number of new patients affected by 
bladder cancer in 2013 attained 10,000 in France, thus improving the diag¬ 
nosis of bladder cancer is a Public Health priority. Determining the genes 
responsible for bladder cancer would undoubtedly permit to design an effi¬ 
cient and adapted set of medical treatments. 

First of all the treatment should depend on the advancement of the tumor. 
For this purpose, researchers have gathered important gene expression data 
and the corresponding state of the malignant tumor in the bladder of 100 
patients in the Lyon region (France). This study concerns 34 selected genes 
and the tumor state was discretized into 4 classes. 

From the statistical perspective, the data can be analyzed using PCA, 
cluster analysis, and polytomic logistic regression. Our approach in this 
section is to use Non-negative Matrix Factorization in order to jointly take 
into account the data’s intrinsic non-negative nature and the necessity of 
clustering the data by performing efficient feature extraction. One of the 
main challenges in the study of such data sets is to take into account possible 
outliers. For the bladder cancer dataset, some outliers have been observed by 
using standard PCA visualization, thus enforcing the need to automatically 
detect such phenomena in order to avoid subsequent misinterpretations of 
the genes’ respective influences on the tumor state. 

The data array consists of one first column providing the tumor state. 
The next 34 columns provide the expression of 34 genes. The array has 100 
lines which correspond to the number of patients. There are two principal 
classes of tumors: 

• TVNIM : noninfiltrating tumors; 

• TVIM : infiltrating tumors. 

The tumor states have been classified into the following groups: 

• Ta : noninfiltrating tumor in Urothelium; 

• Tla : noninfiltrating tumor in Urothelium and parts of the chorion; 

• Tib : noninfiltrating tumor in Urothelium and the full chorion; 

• > T1 : infiltrating tumor. 

In the standard classification, the last group of the list incorporates states 
T2 to T4b. 

5.2. Experimental results. The ADMM algorithm was run on the exper¬ 
imental data. The choice of A was obtained using the strategy described in 
Section]^ The result obtained by this strategy is depicted in Figure 

Based on the optimal choice of A, the algorithm performances and the 
estimation results are depicted in Figureand Figure]^ 

The first subplot of Figure represents the matrix S after convergence, 
while the second subplot is the matrix U*. The third subplot, for its part, 
represents the distance in Frobenius norm between two successive iterates of 
A. Finally, the fourth subplot represents the evolution of the relative error 
between M and its NMF . 

Figure|^shows the cluster index in each subgroup "pTa", "pTla", "pTlb", 
and "more serious than pTl" (ie., ">pTl"). We see a sort of continuous 
drift in these cluster indices from pTa to >pTl. Indeed, 
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Figure 2. The average squared prediction error on the arti¬ 
ficially declared as missing entries as a function of A 




10 20 30 40 so 60 70 80 90 100 



Figure 3. The factorization and convergence curves. The 
first subplot is S after convergence. The second subplot is V^. 

The third subplot, shows the distance between two successive 
iterates of A. The fourth subplot shows the relative error 
between M and its NMF as a function of iteration 

number 

• pTa mostly consists of 3 subgroups indexed by {1,4,6}; 

• pTla mostly consists of 3 subgroups indexed by {4, 5}; 

• pTlb mostly consists of only one group, which is {5}; 

• finally, >pTl mostly consists of 5 subgroups indexed by {2,3,5, 7, 8}. 
The intersection of the index subsets between two adjacent states is always a 
singleton, up to a discarded minority of individuals. The cluster indexed by 
5 appears at medium to serious levels. The lowest level is characterized by 
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cluster 6 while the most serious level is characterized by the more significant 
appearance of clusters 2,3,7, and 8. 

Some more precise investigations should now be performed in order to 
understand the biological meaning of these clusters, i.e., to understand the 
factors of gravity in this cancer. 
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Figure 4. Cluster index for each group of patients. Subplot 
1 corresponds to pTa, subplot 2 to pTla, subplot 3 to pTlb, 
and subplot 4 to > pTl. 

6. Conclusion 

In this article, a new way to find a relevant dictionary for extracting 
the relevant features in a given dataset has been presented, in an original 
context of missing values and outliers. The well-known Non-negative Matrix 
Factorization (NMF) method has been extended on denoised data, where 
missing values have been guessed and outliers have been detected, leading to 
a mixture of Bregman proximal methods and the of Augmented Lagrangian 
scheme. Finally, an application to the analysis of gene expression data of 
patients with bladder cancer has been provided for illustration purpose. 
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